Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review {cfr} v0.1.0 for CRAN release #78

Merged
merged 60 commits into from
Sep 15, 2023
Merged

Review {cfr} v0.1.0 for CRAN release #78

merged 60 commits into from
Sep 15, 2023

Conversation

pratikunterwegs
Copy link
Collaborator

@pratikunterwegs pratikunterwegs commented Aug 21, 2023

This PR is for a full package review prior to a CRAN release, and is also working towards items in the release checklist in #68.

This PR will be used to collect and address reviwer comments. Either this branch, or issue-specific branches will be used to push changes. The changes will be merged into main prior to release.

The proposed timeline is for all reviews to be in by 8th September 2023. The goal is to have {cfr} released to CRAN soon after. This schedule is contingent on releasing {epiparameter} to CRAN.

I would encourage/request two reviewers to sign up to do the full review. Please tick a box and assign yourself in the Reviewers panel on the right to indicate that you have taken on the review.

  • Reviewer 1
  • Reviewer 2

Following review this PR also fixes #80 and fixes #81.

@pratikunterwegs
Copy link
Collaborator Author

Thanks to @abdoelnaser-mahmood-degoot for sending me these comments over Slack - posting them here for all others to read.

I went through the {cfr} package and have some comments that I'd want to share with you.

1- I see an entanglement between two arguments: correct_for_delays and epidist, across several functions in the package resulting from how you designed/implemented the code. If correct_for_delays is set to TRUE, then epidist must be specified. The question is, what are the chances that the user will provide a value for the epidist parameter while ignoring to correct for the delays? When a user enters a value for epidist, he or she implicitly intends to account for delays. The correct_for_delays parameter may aid in the design and implementation of the package, but it may confuse the user.
2- In the quickest example, remove the comma after the data argument on the rolling_cfr_naive function.
3- In the readme, it would be better to offer the reader a brief code chunk that generates the ggplot2 figure.
4- Include a link to the source Camacho et al. in the documentation of ebola1976 data.
5- Mention the data type of severity_baseline and severity_baseline in the documentation of the estimate_ascertainment function parameters, just as you did for the other parameters. Also, what is the default value for the type argument, and why does max_date accept a string instead of a Date?
6- The default value for poisson_threshold, in many functions, is 100 and not 200.
7- Is it correct that you set the default value for the epidist argument to NULL in the known_outcomes and estimate_ascertainment functions? Likewise, for severity_baseline to 0.014; is this true for all diseases?

@jamesmbaazam jamesmbaazam self-requested a review August 29, 2023 10:10
R/estimate_ascertainment.R Outdated Show resolved Hide resolved
R/estimate_ascertainment.R Outdated Show resolved Hide resolved
DESCRIPTION Outdated Show resolved Hide resolved
@pratikunterwegs
Copy link
Collaborator Author

Thanks @Bisaloo, will wait for other reivews to come in and move to fix in the week of 11th Sept.

Copy link
Member

@adamkucharski adamkucharski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good overall. There are few issues I had with pulling from epiparameter – it may be a version problem, but wanted to flag.

General comment: the documentation uses a mix of 'case fatality rate', 'case fatality risk' and 'case fatality ratio', sometimes within the same document (e.g. vignette). We should pick one for consistency. 'Case fatality rate' is most commonly used, even though it is technically inaccurate (the value calculated is a risk rather than a rate). I would therefore go with 'case fatality risk' (or 'case fatality ratio' if this requires less refactoring).

Similarly, 'ascertainment rate' isn't really a rate - perhaps 'ascertainment ratio' or 'ascertainment probability' would be better?

vignettes/cfr.Rmd Outdated Show resolved Hide resolved
vignettes/cfr.Rmd Outdated Show resolved Hide resolved
vignettes/cfr.Rmd Outdated Show resolved Hide resolved
vignettes/cfr.Rmd Outdated Show resolved Hide resolved
vignettes/cfr.Rmd Outdated Show resolved Hide resolved
vignettes/estimate_ascertainment.Rmd Show resolved Hide resolved
vignettes/estimate_ascertainment.Rmd Outdated Show resolved Hide resolved
vignettes/estimate_time_varying_severity.Rmd Show resolved Hide resolved
vignettes/estimate_time_varying_severity.Rmd Outdated Show resolved Hide resolved
vignettes/estimate_time_varying_severity.Rmd Outdated Show resolved Hide resolved
Copy link
Member

@jamesmbaazam jamesmbaazam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{cfr} full package review

I think {cfr} is in good shape and will be ready to be submitted to CRAN following this full package review. I have added general comments below as well as line-by-line comments on most of the changed files. Feel free to request my replies under the conversations I've started.

Routine automatic checks

The following routine checks were performed on the package and did not raise any issues.

  • devtools::check()
  • goodpractice::gp()
  • lintr::lint_package()
  • styler::style_pkg
  • covr::package_coverage()

README

General comments

  • The README looks fine to me. It's clear and concise, and all the badges are appropriate.

Improvements

  • I have added line specific comments to the README file.
  • I suggest to provide a brief overview of the core functions and what they do. This will help users to quickly understand what the package does and how to use it.

Functions

General comments

  • estimate_static():
    • One major comment is for you to add details on the methods and references for the CFR estimation. This is important for people who are not familiar with CFR and want to learn more about it. Same comment goes for known_outcomes().
    • Additionally, I would suggest adding a _cfr suffix to this function and others, i.e. estimate_static_cfr(). This will help users to quickly identify the functions that estimate CFRs.
  • known_outcomes():
  • estimate_ascertainment(): I would be interested to see the methods for this function as I am a little worried about its inner workings. For example, its results are highly sensitive to the severity_baseline argument, which is user supplied rather than estimated. I also have a little quibble about replacing all ascertainment ratios > 1 with 1. It seems hacky and I wonder why it is not dealt with by the model but rather as a post-processing step.

Improvements

  • I have added line specific comments to each function.

Vignettes

General comments

  • The vignettes lay out the core functionality in the package and are insightful. I have added line specific comments.
  • As a general comment, since vignettes are basically literate programming, I usually prefer not to use redundant phrases like "using the code below" or "the code below", "the following code", etc. The accompanying code should reinforce the written words.

Specific comments

  • "Estimating the proportion of cases that are ascertained during an outbreak":
    • Another way to visualize the results to remove the effect of the severity_baseline argument is to plot the relative ratios between the countries so that the baseline severity cancels out, ruling out its effect.
  • "Calculating a static, delay-adjusted estimate of disease severity":
    • I'm not convinced that Fig 3 and 7 are necessary. It's not clear what it adds to the text. I would suggest to remove it. It looks like something that belongs in {epiparameter} rather than {cfr}.
    • I would suggest moving the section "Details: Adjusting for delays between two time series" to a theory vignette like in bpmodels and the details section of the respective functions. To be honest, it looks good here and seems to fit the context. Another way to go about it would be to hyperlink to the theory vignette at the beginning and end of this vignette like you have for the Getting Started vignette.
    • I would suggest replacing the generic use of "known outcomes" with the specifics. I think in this vignette, it is the known death outcomes.
  • "Estimating how disease severity varies over the course of an outbreak":
    • Title seems long. Suggest shortening to, for example, "Estimating time-varying CFRs"
    • Same as before, I'm not convinced that Fig 3 is necessary. It's not clear what it adds to the text. I would suggest to remove it.
    • Suggest adding a legend to Fig 4 and 5 indicate what the grey and tomato lines/ribbons represent.
    • Also, what is going on before April in Figs 4 and 5? The ribbons look weird and are hard to interpret.
    • Refer to my earlier comment above on moving the section "Details: Adjusting for delays between two time series".

Improvements

  • I have added line specific comments to each vignette.

Other infrastructure

DESCRIPTION

Package version is missing the patch number, i.e., currently 0.1 instead of 0.1.0. I would suggest to addding it to conform with the rest of our packages.

Tests

  • Might be worthwhile to provide more descriptive contexts for the tests, for example, test_that("multiplication works", ....
  • I did not inspect the tests in detail so do not have detailed comments.

README.Rmd Outdated Show resolved Hide resolved
README.Rmd Outdated Show resolved Hide resolved
README.Rmd Outdated Show resolved Hide resolved
README.Rmd Outdated Show resolved Hide resolved
README.Rmd Outdated Show resolved Hide resolved
vignettes/estimate_static_severity.Rmd Outdated Show resolved Hide resolved
vignettes/estimate_time_varying_severity.Rmd Outdated Show resolved Hide resolved
vignettes/estimate_time_varying_severity.Rmd Outdated Show resolved Hide resolved
vignettes/estimate_time_varying_severity.Rmd Outdated Show resolved Hide resolved
vignettes/estimate_time_varying_severity.Rmd Show resolved Hide resolved
@joshwlambert
Copy link
Member

Full package review for {cfr}.

Overall the package looks good. It is functionally to the point which should help new users get to grips with the aims and applications of the package. The documentation is clearly written and provides a comprehensive overview of the functions and data requirements.

I like that the package is minimal in terms of namespace but contains a good amount of functionality. The inclusion of two datasets provides two clear use-cases for the package.

I've left some comments below. Some of these overlap with those already made by other reviewers but I thought I'd leave them in just to ensure they weren't missed.

Functions

  • Why does the severity_baseline argument in estimate_ascertainment() have a default of 0.014? This seems to me to be an argument you want to ensure the user specifies to prevent accidentally leaving the default value.
  • In estimate_ascertainment() is the apply() on L127 needed. From what I can see it is will always be a single row data.frame and can also have vectorised operations if it has more than one row.
  • In estimate_ascertainment() (and other functions) there seems to be an interaction of function arguments which is not documentation or caught by input checking. The setup appears to give type precedence so if type = "static" and an <epidist> is supplied it is assumed the user wants to use the "static" estimate. Not sure if this is a problem, and if it is I'm not sure I have a solution. One option (not sure how I feel about as it removes some of the clarity/verbosity of the function arguments) is to remove the type argument and instead control the flow of the function with is.null(epidist).
  • I think he first part of stopifnot() in estimate_rolling() (L61) is already checked by checkmate::assert_names() on L54-57
  • In estimate_static() is there a reason severity_estimate is assigned as an empty numeric? It seems that the if else that follows means that severity_estimate will always be assigned before the function returns
  • In estimate_time_varying() there are two input checks on correct_for_delays
  • The ordering of arguments seems to varying across the estimate_*() functions. Would be good to provide users with consistent argument ordering
  • estimate_time_varying() has a # TODO input checking comment but it seems that all input is checked
  • epidist is checked twice in input checking in estimate_time_varying()
  • In get_default_burn_in() it is possible to have an <epidist> object with a mean of NA. I would try and account for this. If you think it would be useful I can write an S3 method for mean() (thankfully it's a S3 generic in base) which can try and calculate a mean even if it is not given in the summary stats of the object.
  • The epidist default in known_outcomes is NULL. However, if an epidist is not supplied the function will error on L60 at assert_class(epidist, "epidist"). I would suggest removing the argument default.
  • The prepare_data.incidence2() function definition contains .... I suggest adding chkDots(...) to the function which will prevent users supplying arguments to be silently swallowed by .... I would also document that the dots are not used as currently the documentation is somewhat misleading and suggests these arguments will be used by the method.

Documentation

  • estimate_ascertainment() and estimate_time_varying() example fails due to changes in {epiparameter}. Adding single_epidist = TRUE to epidist_db() will solve the issue
  • It would be good to add the full reference of Camacho et al. (2014) to the ebola1976.R file, as you did for the covid_data.R file.
  • Check consistent use of imperative function description and full sentences (e.g. estimate_ascertainment.R)
  • Inconsistent use of data frame, data.frame & <data.frame> in documentation
  • Is estimate_rolling() using a method described or used in a paper, if so could it be referenced in the documentation?
  • known_outcomes can @inheritParams from estimate_static() for @param epidist
  • My assumption is that when users want to read the documentation for a method they use ?generic, in this case ?prepare_data, as this is the function they are interacting with and the dispatch takes place under the hood. I think the documentation for prepare_data.incidence2() is better and should be merged with the documentation for the generic so the user can find it all in one place. As this generic currently only has one method I would reword the documentation as it currently states: "it has methods for classes commonly found in epidemiological data", instead you can say that more methods for common epidemiological data classes will be added over time.
  • Due to documenting both the generic and the method for prepare_data() it appears twice in the package reference on the pkgdown site. If you want to keep both that is perfectly fine. I think the usual practise is to not document the method and just have the generic in the reference (e.g dplyr), but if you plan to add more methods you may want each of these to be shown. It can be quiet useful to have the class name in the function to show which classes the function applies to, in this case document the methods and just export the generic.
  • The code for the plot in the README is completely hidden. Is it worth having it hidden but with an option to reveal the code if users are interested in using it?
  • Is there a reason for the double asterisk in the figure caption of Fig 3 in estimate_static_severity.Rmd
  • Why does library(outbreaks) require a # nolint flag in the data_from_incidence2.Rmd vignette?

Testing

  • Testing coverage for all functions is high except for prepare_data() which has 72.41% coverage.

Package infrastructure

  • update GitHub action workflows to match those in {packagetemplate}
  • Add citation (.cff) file, CITATION file, and update-citation-cff.yml workflow, (this will require an addition to the Rbuildignore which is automatic if I remember correctly)
  • There are some folders in the .gitignore that are missing from the .gitignore from the {packagetemplate} and vice versa, would be good to check if these should be added to.
  • Update .lintr to match {packagetemplate}
  • In data-raw you save the ebola data with usethis::use_data(..., version = 3) and save the covid data with usethis::use_data(..., version = 2). Is there a reason for the difference?
  • A styler run updates a few files

Speculative idea: what if prepare_data() was moved internally to the estimate_*() functions? The estimate_*() functions remain normal (i.e. not S3 generics) but can accept multiple data types in data and prepate_data() does the work for the user. The way I see it the benefits to this approach are: 1) one less exported function (prepare_data() becomes internal); 2) less documentation around wrangling different data sources as the user does not need to know all the details. Downsides to this approach is potentially more arguments in the estimate_*() functions. But the current defaults of prepare_data() would match the requires documented to have a "cases" and a "deaths" column name.

@pratikunterwegs
Copy link
Collaborator Author

Thanks all for your feedback - I shall work on reconciling your suggestions and making changes this week. I would say the aim is to be ready for a CRAN submission as soon as possible, ideally within this week.

@jamesmbaazam
Copy link
Member

This is an interesting idea, and I think it would not take too much effort to implement. Still, I think it's good to have the conceptual separation between preparing data for CFR calculation, and the estimation step, reflected in the code for now. Happy to hear more thoughts on this from others.

The following linked issue in {serofoi} might be relevant to this conversation. @zmcucunuba also thinks that it makes sense to have two separate routines for preparing the data and modelling. See epiverse-trace/serofoi#61.

@pratikunterwegs
Copy link
Collaborator Author

Thanks again everyone for your feedback. The proposed and (some) scheduled changes have now been implemented. This PR will be retargeted to main and merged tomorrow morning. Please leave any final suggestions by then.

@Bisaloo
Copy link
Member

Bisaloo commented Sep 14, 2023

These functions now have the epidist argument set to missing by default, the previous default was NULL

What's the motivation for this change? NULL probably signals more clearly that this argument is optional.

This is discussed in the tidyverse design guide as well: https://design.tidyverse.org/required-no-defaults.html

@pratikunterwegs
Copy link
Collaborator Author

These functions now have the epidist argument set to missing by default, the previous default was NULL

What's the motivation for this change? NULL probably signals more clearly that this argument is optional.

This is discussed in the tidyverse design guide as well: https://design.tidyverse.org/required-no-defaults.html

This has been changed to NULL by default in a5cdca3

@pratikunterwegs pratikunterwegs changed the base branch from empty to main September 15, 2023 10:28
@pratikunterwegs pratikunterwegs merged commit 2b6171a into main Sep 15, 2023
@pratikunterwegs pratikunterwegs deleted the review branch September 15, 2023 13:44
pratikunterwegs pushed a commit that referenced this pull request Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get epidist mean using S3 method Account for changes to epiparameter::epidist_db()
6 participants